A comparative review of dimension reduction methods 
in approximate Bayesian computation 

M. G. B. Blum*} M. A. Nunes} D. Prangle* and S. A. Sisson§ 



Abstract 

Approximate Bayesian computation (ABC) methods make use of comparisons between 
simulated and observed summary statistics to overcome the problem of a computa- 
tionally intractable likelihood functions. The choice of summary statistics involves a 
tradeoff between informativeness and goodness of fit because a larger set of summary 
statistics is more informative but also more difficult to match with the observed set of 
summary statistics. In this article we provide a comprehensive review and comparison 
of the performance of the principal methods of dimension reduction proposed in the 
ABC literature. The methods arc split into three non-mutually exclusive classes con- 
sisting of best subset selection methods, projection techniques and regularisation. In 
addition, we introduce two new methods of dimension reduction. The first is a best sub- 
set selection method based on Akaike and Bayesian information criteria, and the second 
uses ridge regression as a regularisation procedure. We illustrate the performance of 
these dimension reduction techniques through the analysis of three challenging models 
and datasets. 
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1 Introduction 



Bayesian inference is typically focused on the posterior distribution p{9\yobs) oc p{yobs\9)p{(^) 
of a parameter vector 6 E Q M"^, q > 1, representing the updating of one's prior beliefs, 
p{9), through the likelihood (model) function, p{yobs\(^), having observed data yobs £ 3^- The 
term approximate Bayesian computation (ABC) refers to a family of models and algorithms 
that aim to draw samples from an approximate posterior distribution when the likelihood, 
PiVobsl^), is unavailable or computationally intractable, but where it feasible to quickly gener- 
ate data from the model, y ~ p{-\0)- ABC is rapidly becoming a popular tool for the analysis 
of complex statistical models in an increasing number and breadth of research areas. See 



e.g. Lopes and Beaumont (2009) Bertorelle et al. (2010), Beaumont (2010), Csillery et al. 



(2010) and Sisson and Fan (2011) for a partial overview of the application of ABC methods. 

ABC introduces two principal approximations to the posterior distribution. Firstly, 
the posterior distribution of the full dataset, p{0\yobs), is approximated by p{9\sobs) oc 
p{sobs\0)p{0), where Sobs = S{yobs) is a vector of summary statistics of lower dimension than 
the data yobs- In this manner, p{9\sobs) ~ p{9\yobs) is a good approximation if Sobs is highly 
informative for the model parameters, and p{0\sobs) = p{6\yobs) if Sobs is exactly sufficient. 
As p{sobs\G) is also likely to be computationally intractable if p{iiobs\G) is computationally 
intractable, a second approximation is constructed as PABc{(^\sobs) = / p(^5 s\sobs)ds, with 

p{e, s\sobs) cc K,i\\s - Sobs\\)pis\e)p{e), (i) 

where i^e(||M||) = K{\\u\\/e)/e is a standard smoothing kernel with scale parameter e > 0. 
As a result, PABc{0\sobs) ~ p{d\sobs) is a good approximation if the kernel scale parameter. 



e, is small, following standard kernel density estimation arguments (e.g. Blum 2010a). 

In combination, both approximations allow for practical methods of sampling from PABc{(^\sobs) 
that avoid explicit evaluation of the intractable likelihood function, p{yobs\6)- A simple 
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rejection-sampling algorithm to achieve this was proposed by Pritchard et al. (1999) (see 



also Marjoram et al. 2003), which produces draws from p{6,s\sobs)- In general terms, an 



importance-sampling version of this algorithm proceeds as follows: 

1. Draw a candidate parameter vector from the prior, 6' ~ p{0)', 

2. Draw summary statistics from the model s' ~ p{s\6'); 

3. Assign to {9', s') a weight, w', that is proportional to -ft'edls' — Sofes||). 

Here, the sampling distribution for {0',s') is the prior predictive distribution, p{s\6)p{6), 
and the target distribution is p{9,s\sobs) given by ([T|. As such, the importance weight for 
the pair {9',s') is proportional to p{0' , s'\sobs) /[p{s'\9')p{9')] = K^{\\s' — Sobs||), which is free 
of intractable likelihood terms, p{s'\9'). The manner by which the intractable likelihoods 
cancel between sampling and target distributions forms the basis for the majority of ABC 
algorithms. 

Clearly, both ABC approximations to the posterior distribution help to avoid the com- 
putational intractability of the original problem. The first approximation allows the kernel 
weighting of the second approximation, i^^^dl-s — Sobs||), to be performed on a lower dimension 
than that of the original data, yobs- Kernel smoothing is known to suffer from the curse of 



dimensionality (e.g. Blum 2010a), and so keeping dim(s) < dim(?/) as small as possible helps 
to improve algorithmic efficiency. The second approximation ([T]) allows the sampler weights 
(or acceptance probabilities, if one considers rejection-based samplers, such as Markov chain 
Monte Carlo) to be free of intractable likelihood terms. 

In practice, however, there is typically a tradeoff between the two approximations: If 
the dimension of s is large so that the first approximation, p{9\sobs) ~ p{9\yobs) is good, the 
second approximation may then be poor due to the inefficiency of kernel smoothing in large 
dimensions. Conversely, if the dimension of s is small, while the second approximation ([T]) 
will be good (with a small kernel scale parameter, e), any loss of information in the mapping 



Sobs = S{yobs) means that the first approximation may be poor. Naturally, a low-dimensional 
and near-sufficient statistic, s, would provide a near-optimal and balanced choice. 

For a given set of summary statistics, much work has been done on deriving more ef- 
ficient sampling algorithms to reduce the effect of the second approximation by allowing a 
smaller value for the kernel scale parameter, e, which in turn improves the approximation 
PABc{^\sobs) ~ p{0\sobs)- The greater the algorithmic efficiency, the smaller the scale pa- 
rameter that can be achieved for a given computational burden. These algorithms include 



Markov chain Monte Carlo (Marjoram et al. 2003 Bortot et al. 2007) and sequential Monte 



Carlo techniques (Sisson et al. 2007 Toni et al. 2009 Beaumont et al. 2009 Drovandi and 



Pettitt 2011 Peters et al. 2012 Del Moral et al. 2012). By contrast, the regression-based 



methods described in Section [271] do not aim at reducing the scale parameter e but rather ex- 
plicitly account for the imperfect match between observed and simulated summary statistics 



(Beaumont et al. 2002 Blum and Frangois 2010), 

Achieving a good tradeoff between the two approximations revolves around the identifica- 
tion of a set of summary statistics, s, which are both low- dimensional and highly informative 
for 6. A number of methods, primarily based on dimension reduction ideas, have been pro- 



posed to achieve this (Joyce and Marjoram 2008 Wegmann et al. 2009; Nunes and Balding 



2010 Blum and Frangois 2010 Blum 2010b Fearnhead and Prangle 2012). The choice of 



summary statistics is one of the most important aspect of a statistical analysis using ABC 
methods (along with the choice of algorithm). Poor specification of s can have a large and 
detrimental impact on both ABC model approximations. 

In this article we provide the first detailed review and comparison of the performance of 
the current methods of dimension reduction for summary statistics within the ABC frame- 
work. We characterise these methods into three non-mutually exclusive classes: (i) best 
subset selection, (ii) projection techniques and (iii) regularisation approaches. As part of 
this analysis, we introduce two additional novel techniques for dimension reduction within 
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ABC. The first adopts the ideas of Akaike and Bayesian information criteria to the ABC 
framework, whereas the second makes use of ridge regression as a regularisation procedure 
for ABC. The dimension reduction methods are compared through the analysis of three 
challenging models and datasets. These involve the analysis of a coalescent model with 



recombination (Joyce and Marjoram 2008), an evaluation of the evolutionary fitness cost 



of mutation in drug- resistant tuberculosis (Luciani et al. 2009), and an assessment of the 



number and size-distribution of particle inclusions in the production of clean steels (Bortot 



et al. 2007). 



The layout of this article is as follows: In Section [2] we classify and review the existing 
methods of summary statistic dimension reduction in ABC, and in Section [3} we outline our 
two additional novel methods. A comparative analysis of the performance of each of these 
methods is provided in Section |4j We conclude with a discussion. 

2 Classification of ABC dimension reduction methods 

In a typical ABC analysis, an initial collection of statistics = (si, . . . , Sp) is chosen by 
the modeller, the elements of which have the potential to be informative for the model 
parameters, 9^ = {9i, . . . ,9g). Choice of these initial statistics is highly problem specific, 
and the number of candidate statistics, p, often considerably outnumbers the number of 



model parameters, q i.e. p » q (e.g. Bortot et al. 2007 AUingham et al. 2009 Luciani 



et al. 2009). The analysis then proceeds by either using all p statistics in full, or by 
attempting to reduce their dimension while minimising information loss. Clearly, a poor 
choice of initial summary statistics can result in significant loss of information, even before 
reaching the dimension reduction stage. As such, it may be prudent to construct a very large 
set of initial statistics with the aim of capturing all aspects of the data under the model, or 
simply specify S{y) = y (or more practically for real- valued data, the ordered observations 
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S{y) = . . . , c.f. Bortot et al. 2007 AUingham et al. 2009) so that there is no loss 
of information at this stage. Note also that the most suitable set of summary statistics for an 
analysis is likely to be dataset dependent (an exception is when low dimensional sufficient 
statistics are known). As such, any analysis should also consider establishing potentially 
different summary statistics when re-implementing any model with a different dataset. 

Methods of summary statistics dimension reduction for ABC can be broadly classified 
into three non-mutually exclusive classes. The first class of methods follows a best subset 
selection approach. Here, candidate subsets are evaluated and ranked according to various 



information-based criteria, such as measures of sufficiency (Joyce and Marjoram 2008) or the 



entropy of the posterior distribution (Nunes and Balding 2010). In this article we contribute 
additional criteria for this process derived from Akaike and Bayesian information criteria 
arguments. From these criteria, the highest ranking subset (or alternatively, a subset con- 
sisting of those summary statistics which demonstrate clear importance) is then chosen for 
the final analysis. 

The second class of methods can be considered as projection techniques. Here, the di- 
mension of (si, . . . ,Sp) is reduced by considering linear or non-linear combinations of the 
summary statistics. These methods make use of a regression layer within the ABC frame- 
work, whereby the response variable, 6, is regressed by the (possibly transformed) predictor 



variables, s, (Beaumont et al. 2002; Blum and Frangois 2010). These projection methods 



include partial least squares regression (Wegmann et al. 2009), feed-forward neural net- 



works (Blum and Frangois 2010) and regression guided by minimum expected posterior loss 



considerations (Fearnhead and Prangle 2012). 

In this article we introduce a third class of methods for dimension reduction in ABC, 
based on regularisation techniques. Using ridge regression, we also make use of the regression 
layer between the parameter 9 and the summary statistics, s. However, rather than explic- 
itly considering selection of summary statistics, we propose to approach this implicitly, by 
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shrinking the regression coefficients towards zero so that uninformative summary statistics 
have the weakest contribution in the regression equation. 

In the remainder of this Section we discuss each of these methods in more detail. We 



first describe the ideas behind ABC regression adjustment strategies (Beaumont et ah 2002 



Blum and Frangois 2010), as many of the dimension reduction techniques build on this 



framework. 

2.1 Regression adjustment in ABC 

Standard ABC methods suffer from the curse of dimensionality in that the rate of conver- 
gence of posterior expectations with respect to PABc{(^\sobs) (such as the Nadaraya- Watson 
estimator of the posterior mean) decreases dramatically as the dimension of the summary 



statistics, p, increases (Blum 2010a). ABC regression adjustment (Beaumont et al. 2002) 
aims to avoid this by explicitly modelling the discrepancy between s and Sobs- When describ- 
ing regression adjustment methods, for notational simplicity and clarity of exposition, we 
assume that the parameter of interest, 9, is univariate (i.e. g = 1). Regression adjustment 
methods may be readily applied to multivariate 6, by using a different regression equation 
for each parameter, 9i, . . . ,9q, separately. 

The simplest model for this is a homoscedastic regression in the region of Sots, so that 

9' = m{s') + e\ 

where (^*,s*) ~ p{s\9)p{9) are i = l,...,n draws from the prior predictive distribution, 
m{s^) = E,[9\s = s*] is the mean function, and the e* are zero-mean random variates with 



common variance. To estimate the conditional mean m(-), Beaumont et al. (2002) assumed 
a linear model 

m{s') = a + s' (2) 
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in the neighborhood of Sobs- An estimate of the mean function, m(-), is obtained by mini- 
mizing the weighted least squares criterion J27=i w*||m(s*) — O^W^ where = K^{\\s^ — So^H). 
A weighted sample from the posterior distribution, pABciO\sobs) is then obtained by the 
adjustment 

e*' = m{sobs) + {e'-m{s')) (3) 

for i = 1, . . . ,n. In the above, the kernel scale parameter e controls the bias- variance tradeoff: 
Increasing e reduces variance by increasing the effective sample size — the number of accepted 
simulations when using a uniform kernel K — but increases bias arising from departures from 



a linear mean function m(-) and homoscedastic error structure (Blum 2010a). 



Blum and Frangois (2010) proposed the more flexible, heteroscedastic model 

e' = m{s') + a{s')e\ (4) 



where (j^(s') = Y[6\s = s*] denotes the conditional variance. This variance is estimated 
using a second regression model for the log of the squared residuals i.e. log(6'* — m(s*))^ = 
log o"^(s*) +7]*, where the 1]^ are independent, zero-mean variates with common variance. The 
equivalent adjustment to ([s]) is then given by 

e*' = m{sobs) + [e' - m{sl] (5) 

where a{s) denotes the estimate of o"(s). The kernel scale parameter, e, plays the same role 
as for the homoscedastic model, except with more flexibility on deviations from homoscedas- 



ticity. Nott et al. (2011) have demonstrated that regression adjustment ABC algorithms 
produce samples, {6*^}, for which first- and second-order moment summaries approximate 
adjusted expectation and variance for a Bayes linear analysis. We do not describe here an al- 
ternative regression adjustment method where the summary statistics are rather considered 
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as the dependent variables and the parameters as the independent variables of the regression 
(Leuenberger and Wegmann 2010). 



2.2 Best subset selection methods 

Best subset selection methods are conceptually simple, but are cumbersome to manage for 
large numbers of potential summary statistics, s = {si, . . . , Sp). Exhaustive enumeration 
of the TP — \ possible combinations of summary statistics is practically infeasible beyond 
a moderate value of p. This is especially true of Markov chain Monte Carlo or sequential 
Monte Carlo based analyses, which require one sampler implementation per combination. As 
a result, stochastic or deterministic (greedy) search procedures, such as forward or backward 
selection, are required to implement them. 

2.2.1 A sufficiency criterion 

The first principled approach to dimension reduction in ABC was the e-sufficiency concept 



proposed by Joyce and Marjoram (2008) , which was used to determine whether to include an 
additional summary statistic, s^, to a model already containing statistics Si, . . . , s^-i. Here, 
noting that the difference between the log likelihoods of p(si, . . . , S}^ff) and p(si, . . . , sa;_i|^) 



is logp(sj!c|si, . . . , Sfc_i, 6*), Joyce and Marjoram (2008) defined the set of statistics Si, . . . , 



-1 



to be ^-sufficient relative to if 

4 = suplogp(sfc|si, . . . ,Sfc_i,6') - inf logp(sfc|si, . . . ,Sfc-i,6') < e. (6) 

Accordingly, if an estimate of 5/,. (i.e. the "score" of relative to Si,...,Sfc_i) is greater 
than £, then there is enough additional information content in to justify including it in 



the model. In practice, Joyce and Marjoram (2008) implement a conceptually equivalent 
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assessment, whereby Sk is added to the model if the ratio of posteriors 

Pabc{6\si, . . . , Sk-i, Sk) 



Rk{0) 



PABc{d\si, ... , Sk-l] 



differs from one by more than some threshold value T{6) for any value of 6. As such, a 
statistic Sk will be added to the model if the resulting posterior changes sufficiently at any 
point. The threshold, T{9), is user-specified, with one particular choice described in Section 



5 of Joyce and Marjoram (2008) 



This procedure can be implemented within any stepwise search algorithm, each of which 
have various pros and cons. Following the definition ([6]), the resulting optimal subset of sum- 
mary statistics is then e-sufficient relative to each one of the remaining summary statistics. 
Here e intuitively represents an acceptable error in determining whether Sk contains further 
useful information in addition to Si, . . . , s^. This quantity is also user-specified, and so the 
final optimal choice of summary statistics will depend on the chosen value. 

Sensitivity to the choice of e aside, this approach may be criticised in that it assumes 
that every change to the posterior obtained by adding a statistic, Sk, is beneficial. It is 
conceivable that attempting to include a completely non-informative statistic (in an exact 
sufficiency sense) that is strongly inconsistent with the model, will result in a sufficiently 
modified posterior as measured by e, but one which is more biased away from the true 
posterior p{6\yobs) than without including Sk. A toy example illustrating this was given by 



Sisson and Fan (2011) 



A further criticism is that the amount of computation required to evaluate Rk{0) for 



all ^, and on multiple occasions, is considerable, especially for large q. In practice, Joyce 



and Marjoram (2008) considered 6 to be univariate, and approximated continuous 6 over a 
discrete grid in order to keep computational overheads to acceptable levels. As such, this 
method appears largely restricted to dimension reduction for univariate parameters (g = 1). 
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2.2.2 An entropy criterion 



Nunes and Balding (2010) propose the entropy of a distribution as a heuristic to measure the 
informativeness of candidate combinations of summary statistics. Since entropy measures 



information and a lack of randomness (Shannon and Weaver 1948), the authors propose min- 



imising the entropy of the approximate posterior, PABc{(^\sobs), over subsets of the summary 
statistics, s, as a proxy for determining maximal information about a parameter of interest. 
High entropy results from a diffuse posterior sample, whereas low entropy is obtained from 
a posterior which is more precise in nature. 

Nunes and Balding (2010)] estimate entropy using the unbiased fc-th nearest neighbour 
estimator of Singh et al. (2003) For a weighted posterior sample, {w^,9^), . . . , (ly", 6'"), 



where ^jW* = 1, this estimator can be written as 



E = lo£ 



q/2 



Lr(g/2+l) 



-ij{k) + logn + q^w'logC^^{k/{n-l)), 



(7) 



i=l 



where q = dim(^), ?/'(x) = r'(a;)/r(a;) denotes the digamma function, and where C*j(') 
denotes the empirical distribution function of the Euclidean distance from 6'* to the remainder 
of the weighted posterior sample i.e. of the weighted samples {{w^ , \\9^ — 9m)}j^i, where 
= / ^j-^iWK Following Singh et al. (2003), the original work of Nunes and Balding 



(2010) used A; = 4, and was based on an equally weighted posterior sample (i.e. with 
= l/n,i = 1, . . . ,n), so that C^^{k/{n — 1)) denotes the Euclidean distance from ^* to 

its fc-th closest neighbour in the posterior sample {^^, . . . , 6'*"^, 6'*"'"^, . . . , 9^}. 

While minimum entropy could in itself be used to evaluate the informativeness of a vector 



of summary statistics for 9 (although see the criticism of entropy below), Nunes and Balding 



(2010) propose a second stage to their analysis, which aims to assess the performance of a 
candidate set of summary statistics using a measure of posterior error. For example, when 
the true parameter vector, 9truei is known, the authors suggest the root sum of squared errors 
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(RSSE), given by 

|2 



RSSE= ^J^W'W -^truel 
,i=l 



where the measure \\6^ — OtmeW compares the components of ^ on a suitable scale (and so 
some componentwise standardisation may be required). Naturally, the true parameter value, 
Otrue, is unknown in practice. However, if the simulated summary statistics from the samples 
(6'*,s*) are treated as observed data, it is clear that Otme = for the posterior Pabc{^\s^)- 
As such, the RSSE can be easily computed with a leave-one-out technique. 

As the subset of summary statistics that minimises ([8| will likely vary over observed 



datasets, s\ Nunes and Balding (2010) propose minimising the average RSSE over some 



number of simulated datasets which are close to the observed, Sobs- To avoid circularity. 



Nunes and Balding (2010) define these "close" datasets to be the j = 1, . . . ,n* simulated 
datasets, {s-^}, that minimise — smb||, where s]^^^ and sme are the vectors of minimum 
entropy summary statistics computed via ([t]) from and the observed summary statistics. 
Sobs, respectively. That is, the quantity 

RSSE= — ^RSSEj, (9) 

is minimised (over subsets of summary statistics), where RSSEj corresponds to ^ using the 
simulated dataset 

This approach is intuitive, and is attractive because the second stage directly measures 
error in the posterior with respect to a known truth, Otme, which is not typically considered in 
other ABC dimension reduction approaches, albeit at the extra computational expense of a 
two-stage procedure. A weakness of the first stage however, is the assumption that addition 
of an informative statistic will reduce the entropy of the resulting posterior distribution. An 
example of when this does not occur is when the posterior distribution is diffuse with respect 
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to the prior - for instance, if an overly precise prior is located in the distributional tails of 



the posterior (e.g. Jeremiah et al. 2011). In this case, attempting to include an informative 
additional statistic, Sk, can result in a distribution that is more diffuse than with Sk excluded. 
As such, the entropic approach is therefore mostly suited to models with relatively diffuse 



prior distributions. However, it is clear that the overall approach of Nunes and Balding 



(2010) could easily be implemented with alternative first-stage selection criteria. 



2.2.3 AIC and BIC criteria 

Information criteria based on Akaike and Bayesian information are natural best subset selec- 
tion techniques for summary statistic dimension reduction in ABC analyses. We introduce 



and develop these criteria in Section 3.1 



2.3 Projection techniques 

Selecting a best subset of summary statistics from s = (si, . . . , Sp) suffers from the problem 
that it may require several statistics to provide the same information content as a single, 
highly informative statistic that was not specified in the initial set, s. To avoid this, projec- 
tion techniques aim to combine the elements of s through linear or non-linear transforma- 
tions, in order to construct a potentially much lower dimensional set of highly informative 
statistics. 

One of the main advantages of projection techniques is that, unlike best subset selection 
methods, they scale well with increasing numbers of summary statistics. They can handle 
large numbers of possibly uninformative summary statistics, in addition to accounting for 
high levels of inter-dependence and multicolinearity. A minor disadvantage of projection 
techniques is that the final sets of projected summary statistics typically (but not universally) 
lack interpretability. In addition, most projection methods require the specification of a 
hyperparameter that governs the number of projections to perform. 
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2.3.1 Partial least squares regression 



Partial least squares regression seeks the orthogonal linear combinations of the explanatory 
variables which have high variance and high correlation with the response variable (e.g. 



Boulesteix and Strimmer 2007; Vinzi et al. 2010 Abdi and Williams 2010). Wegmann 



et al. (2009) proposed the use of partial least squares regression for dimension reduction in 



ABC, where the explanatory variables are the suitably (e.g. Box-Cox) transformed summary 
statistics, s, and the response variables is the parameter vector, 6. 

The output of a partial least squares analysis is the set of k orthogonal components of 
the regression design matrix 



X 



1 



(10) 



1 . . . 



that are optimally correlated (in a specific sense) with 6. Here, s* denotes the j component 



of the i*^^ simulated summary statistic, s*. To choose the appropriate number of orthogonal 



components, Wegmann et al. (2009) examine the root mean square error of 6 for each 



value of k, as estimated by a leave-one-out cross validation strategy. For a fixed number of 
components, k, this corresponds to 



RMSE. 



n ^ 

> 1=1 



1/2 



yi ||2 



:ii) 



where mf^\s) denotes the mean response of the partial least squares regression, estimated 



without the i-th simulated summary statistic, (e.g. Mevik and Cederkvist 2004). The 



optimal number of components is then chosen by inspection of the RMSE^ values, based on 



minimum gradient change arguments (e.g. Mevik and Wehrens 2007). 

A potential disadvantage of partial least squares regression, as performed by [Wegmanii 
et al. (2009)1 is that it aims to infer a global linear relationship between 6 and s based on 
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draws from the prior predictive distribution, p{s\6)p{6) . This may differ from the relationship 
observed in the region around s = Sobs, and as such may produce unsuitable orthogonal 



components as a result. A workaround for this would be to follow Fearnhead and Prangle 



(2012) (see Section 2.3.3) and elicit the relationship between 6 and s based on samples from 
a truncated prior {9\ s*) ~ p{s\9)p{9)I{9 G 6"^), where C restricts the samples, 9\ to 
regions of significant posterior density. 

2.3.2 Neural networks 

In the regression setting, feed-forward neural networks can be considered as a non-linear 



generalisation of the partial least squares regression technique described above. Blum and 



Frangois (2010) proposed the neural network as a machine learning approach to dimension 
reduction by estimating the conditional mean and variance functions, m(-) and a\-) in the 
non-linear, heteroscedastic regression adjustment model ^ - see Section [2T 



The neural network reduces the dimension of the summary statistics to H < p, using H 
hidden units in the network, zi, . . . , zh, defined as 



.fc=i 



for j = 1, . . . , H. The u^} terms are the weights of the first layer of the neural network, and 
h{-) is a non-linear function, typically the logistic function. The reduced and non-linearly 
transformed summary statistics of the hidden units, zj, are then combined through the 
regression function of the neural network 



H 



m 



:s)=9[J2ujfz,+uji'^], (13) 



where oof"^ denotes the weights of the second layer of the neural network and g{-) is a 
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link function. A similar neural network is used to model logcr^(s) (e.g. Nix and Weigend 



1995), with the possibility of allowing for a different number of hidden units to estimate 



heteroscedasticity in the regression adjustment compared to that in the mean function m{s). 



Rather than dynamically determining the number of hidden units H, Blum and Frangois 



(2010) propose to specify a fixed value, such as if = g where q = dim(6') is the number of 
parameters to infer. The weights of the neural network are then obtained by minimising the 
regularised least-squares criterion 



i=l 



where u is the vector of all weights in the neural network model for m{s), = K^{\\s^ — SotsW) 
is the weight of the sample {6\ s*) ~ p{s\0)p{d) , and A > denotes the regularisation param- 
eter (termed the weight-decay parameter for neural networks). The idea of regularisation is 
to shrink the weights towards zero so that only informative summary statistics contribute in 



the model (12) and (13) for m{s). Following the estimation of m(s), a similar regularisation 
criterion is used to estimate logcr^(s). Both mean and variance functions can then be used 
in the regression adjustment of equation (|5|. 

2.3.3 Minimum expected posterior loss 



Fearnhead and Prangle (2012) proposed a decision-theoretic dimension reduction method 
with a slightly different aim to previous dimension reduction approaches. Here, rather than 
constructing appropriate summary statistics to ensure that PABci^^l^obs) ~ pi^^lUobs) is a good 
approximation, pABc{(^\sobs) is alternatively required to be a good approximation in terms of 
the accuracy of specified functions of the model parameters. In particular, assuming that in- 
terest is in point estimates of the model parameters, if Otme denotes the true parameter value 
and 6 an estimate, then 



Fearnhead and Prangle (2012) propose to choose those summary 
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statistics that minimise the quadratic loss 



L{9true, 0) = {Otrue — 



true 



for some p x p positive-definite matrix A. This loss is minimised for Sobs = Ep(e\yobs)i^)i 
true posterior mean. 



To estimate Ep(^0\y){9)^ Fearnhead and Prangle (2012) propose least-squares regression 



models for the /c = 1, . . . , g model parameters, (6'i, . . . , 6^^), given by 

ei = E,^e\y){e,) + vl = c^k + Pjfis^) + Vl (14) 

where (6'*, s*) ~ p{s\0)p{0) are draws from the prior predictive distribution, and Pk are 
unknown regression parameters to be estimated, and r]l denotes a zero-mean noise process. 
Here f{s) is a vector of potentially non-linear transformations of the data (i.e. of the original 



use 



summary statistics). For example, in one application, Fearnhead and Prangle (2012) 
the polynomial basis functions f{s) = (s, s^, s^, s^); that is, a vector of length 4p, where 
p = dim(s) is the number of elements in s, consisting of the first four powers of each element 
of s. The choice of f{s) can be based on standard diagnostics of regression fit, such as BIC. 
If the prior 7[{9) is diffuse with respect to the posterior, then one may estimate the regression 



model (14) based on samples from a truncated prior {6\s^) ~ p{s\6)p{6)I{6 G G"^), where 
qR ^ restricts the samples, 6^, to regions of significant posterior density. 



After fitting equation (14), the new, single summary statistic for the parameter 9k is 
f3j f{s), where f3k denotes the least squares estimate of (3k. The resulting g-dimensional 



vector of new summary statistics is then used in a standard ABC analysis. Fearnhead 



and Prangle (2012) show that these new statistics can lead to posterior inferences that 



considerably outperform inferences based on the original statistics, s. Nott et al. (2012) 



demonstrate that these summary statistics can be viewed as Bayes linear estimates of the 
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posterior mean. 



2.4 Regularisation approaches 

Regularisation approaches aim to reduce overfitting in a model by penalising model com- 
plexity. A simple example where overfitting can occur in ABC is the standard regression 



adjustment (Beaumont et al. 2002 Section 2.1), where there is a risk of over adjusting the 



parameters, in the direction of uninformative summary statistics via ([s]). Regularisation 
is used as part of the estimation of the neural network weights in the projection technique 



proposed by Blum and Frangois (2010) (see Section 2.3.2). As such, the regression adjust 



ment of Beaumont et al. (2002) is a procedure that could greatly benefit from the inclusion of 



regularisation techniques. We introduce the ridge regression adjustment to ABC in Section 



2.5 Other methods 



Drovandi et al. (2011) proposed to adopt ideas from indirect inference (e.g. Heggland and 



Frigessi 2004 Jiang and TurnbuU 2004) as a means to identify summary statistics for an 



ABC analysis. This involves specification of a model p{-\0) which is similar to p{-\0), but 
which is computationally tractable. The idea is that estimates of 6 under p{-\6), such as 
maximum likelihood estimates or posterior means, are likely to be informative about 6 if 
p{-\0) and p{-\0) are sufficiently similar. This approach can be considered similar in spirit to 



that of Fearnhead and Prangle (2012) which uses estimated posterior means under a pilot 



ABC analysis (see Section 2.3.3) 



Blum (2010b) proposed a Bayesian criterion related to the BIC (see Section 3.1 ) as a best 



subset selection procedure. The idea is to implement a Bayesian analysis of the standard 
regression adjustment model (|3|. The criterion, called the evidence approximation, seeks 
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the best subset of summary statistics to regress the parameter 6. In comparison to the BIG, 
the evidence criterion is attractive because it contains no approximation in its derivation. 
However, the downside is that its computation requires the tuning of the Bayesian hnear 
regression hyperparameters. 

Finally, a number of recent ABC modelling approaches have attempted to find ways of 
accurately handling the full vector of initial statistics, s, (or the full dataset, s = S{y) = y) 



thereby avoiding the need to perform dimension reduction. Bonassi et al. (2011) propose 
fitting a (p + g)-dimensional mixture of Gaussian distributions to the sample (^*, s*) ~ 
p{s\9)p{9), i = 1, . . . ,n, and then find the distribution of 6\sobs by conditioning on observing 
s = Sobs- This approach potentially requires a large number of mixture components to 



accurately model the joint density when {p + q) is large. Fan et al. (2012) suggest using an 
approximation to p{s\9) by approximating each marginal likelihood function, p{si\9), using a 
mixtures of experts model, where the weights, mean, and variance of each mixture component 
is allowed to depend on 9, and then inducing dependence between these marginals using a 
mixture of multivariate Gaussian distributions. This approach requires continuous summary 
statistics for the mixture regression, and is practically useful for moderate p (i.e. hundreds 



of summary statistics). Writing y = {yi, . . . ,yp), Barthelme and Ghopin (2011) propose to 
factorise the likelihood as p{y\9) = Y[iP{yi\yi:i-i,9) and construct an ABG approximation 
of each component in turn (i.e. PABciVilyi-.i-i) = J KeiWVi - yobs,i\\)Piyi\yi--i-i^^)dyi) with 
computation performed using an expectation-propagation algorithm ( [Minka 2001 ). This 
approach, while potentially fast and accurate, assumes that conditional simulation of yi ~ 
p{yi\yi:i-i, 9) is available for i = 1, . . . ,n, and so is not suitable for all models and analyses. 
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3 New dimension reduction methods 



In this Section, we introduce two new dimension reduction criteria for ABC methods. The 
first is a best subset selection procedure deriving from AIC and BIG criteria, constructed 
under implementation of the local linear model of equation ^ (Beaumont et al. 2002). 



The second is a modification to the fitting of ^ by considering ridge regression instead of 
least-squares regression. Both of these methods are now implemented in the freely available 



R package abc (Csillery et al. 2012) 



3.1 AIC and BIC criteria 

Akaike information criterion (AIC) and Bayesian information criterion (BIG) provide a mea- 
sure of the relative goodness of fit of a statistical model. Each can be expressed as the sum 
of the maximized log-likelihood that measures the fit of the model to the data, and a penalty 



for model complexity (Akaike 1974 Schwarz 1978). While evaluation of log p{yobs\(^mie) or 



log p{sobs\(^mie) IS Unavailable in the ABG framework, and determination of the maximum 
likelihood estimator, 9mie, is challenging, a simple and tractable likelihood function is avail- 
able though the local-linear regression model of equation ^ (Section 2.1). 



Specifically, we consider the local linear regression model equation (|2| of Beaumont et al. 



(2002) for each parameter 6'i, . . . , 9q, and assume independent Gaussian errors, ~ N{0, a- 



for j = 1, . . . , g. Then the AIG becomes 

9 

AIG = nlogJj6-| + 2ci, (15) 

where d = q{p + 1) is the number of estimated regression parameters, and n is the effective 
number of simulations used in the local-linear regression model, which we define as n = 
^r=i -^(^* > 0) when the kernel has compact support. Alternative definitions of the 
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effective number of simulations, such as c^"^^w* for some c > 0, can be on an arbitrary 
scale, since the least-squares regression solution is insensitive to the scale of the weights. 
For any fixed value of c, the value of c Y17=i "^ill decrease as p = dim(s) increases so 
that it will artificially favour larger numbers of (even uninformative) summary statistics. 
Our definition of h guarantees that the AIC scores are comparable for different subsets of 
summary statistics. A downside is that this definition of n is only suitable for kernels, K^, 
with a compact support. 



In equation ( 15 ), is defined as the weighted mean of squared residuals for the regression 
of 6j, and is given by 



where 6j is the j'*'* component of 6\ and rhj{s) denotes the estimate of the mean function 
mj{s) = E,[6j\s]. For small sample sizes, the corrected AIC, the so-called AICc, is given by 



replacing d in (15) by d{d + l)/{n — d — 1) (Hurvich and Tsai 1989). In the same manner 



the BIG can be defined as 

BIC = nlogJJ(jJ + dlogn. (16) 
Alternative penalty terms involving the hat matrix of the regression could also be used in 



the above (e.g. Hurvich et al. 1998; Irizarry 2001 Konishi et al. 2004). 



It is instructive to note that in using the linear regression adjustment (|3]), the above 
information criteria may be expressed as 

q 

xIC = h log Y\ Var(^*) + penalty term, 

i=i 

where 9* is the j*^ element of the regression adjusted vector 9* = {91, ... ,9*). As such, up 
to the penalty terms, both AIC and BIC seek the combination of summary statistics that 
minimizes the product of the marginal variances of the adjusted posterior sample. Similarly 
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to the entropy criterion of Nunes and Balding (2010) (see Section 2.2.2), these information 



criterion will select those summary statistics that maximise the precision of the posterior 



distribution, pABc{0\sobs)- However, unlike Nunes and Balding (2010) , this precision is traded 
off by a penalty for model complexity. 

A rationale for the construction of AlC and BIG in this manner, is that those summary 
statistics informative when performing the local linear regression of ^ on s are also the 
statistics that provide information when approximating p{9\yohs) with pABc{0\sohs)- The 
criteria are also simple to compute and interpret. However, an obvious requirement for AIC 
or BIG to identify an informative statistic is that the statistic varies (with 6) within the 
local range of the regression model. If a statistic is informative outside of this range, but 
uninformative within it, it will not be identified as informative under these criteria. 

3.2 Regularisation via ridge regression 



As described in Section 2.1 , the local-linear regression adjustment of Beaumont et al. (2002) 
fits the linear model 

6' = a + P^s' + e' 

based on the prior predictive samples {9\s^) ~ p{s\9)p{9), and with regression weights 
given by = K^{\\s^ — Sobs\\)- (As before, we describe the case where 9 is univariate 
for notational simplicity and clarity of exposition, but the approach outlined below can be 
readily implemented for each component of a multivariate 9.) However, in fitting the model 
by minimising the weighted least squares criteria, Yl^i=i w^\\<y — P~^s^ — there is a risk of 
over- adjustment by adjusting the parameter values via ([s]) in the direction of uninformative 
summary statistics. 

To avoid this, implicit dimension reduction within the regression framework can be per- 



formed by alternatively minimising the regularised weighted sum of squares ( Hoerl and Ken 
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nard 1970 ) 



1=1 



w 



(17) 



with regular isat ion parameter A > 0. As with the regularisation component within the neural 



network model of Blum and Frangois (2010) (Section 2.3.2), with ridge regression the risk 
of over- adjustment is reduced because the regression coefficients, /3, are shrunk towards zero 
by imposing a penalty on their magnitudes. 

An additional advantage of ridge regression is that standard least squares estimates, 
{o-Ls-, I^LsY = {X^W X)^'^X^WQ, are not guaranteed to have a unique solution. Here X 



is a n X (p + 1) design matrix given in equation (10), O = {9^, . . . , 9^) is the n x 1 column 
vector of sampled 6'*, and W = diag(w^, . . . , w^) is an n x n diagonal matrix of weights. The 
lack of a unique solution can arise through multicolinearity of the summary statistics, which 
can result in singularity of the matrix X^WX. In contrast, minimisation of the regularised 



weighted sum of squares (17) has always a unique solution, provided that A > 0. This 
solution is given by (aridge, /3ridgc)^ = {X^WX + \Ip)~^X~^WQ, where Ip denotes the p x p 
identity matrix. There are several approaches for dealing with the regularisation parameter 
A, including cross-validation and generalised cross-validation to identify an optimal value of 



A (Golub et al. 1979), as well as averaging the regularised estimates (ciridge, /Sridge)^ obtained 



for different values of A (Taniguchi and Tresp 1997). 



4 A comparative analysis 



We now provide a comparative analysis of the previously described methods of dimension 
reduction within the context of three previously studied analyses in the ABC literature. 



Specifically, this includes the analysis of a coalescent model with recombination (Joyce and 



Marjoram 2008), an evaluation of the evolutionary fitness cost of mutation in drug-resistant 



tuberculosis (Luciani et al. 2009), and an assessment of the number and size-distribution of 
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particle inclusions in the production of clean steels (Bortot et al. 2007). 

Each analysis is based on n = 1, 000, 000 simulations where the parameter 9 is drawn 
from the prior distribution p{9). The performance of each method is measured through 



the RSSE criterion ([9j) following Nunes and Balding (2010) , based on the same randomly 
selected subset of n* = 100 samples {9\s^) = {9true,Sobs) as 'observed' datasets. When 
evaluating the RSSE error measure of equation ([s]), we give a weight = 1 for the accepted 
simulations and a weight of otherwise. As the value of the RSSE ([s]) depends on the 
scale of each parameter, we standardise the parameters in each example by dividing the 
parameter values by the standard deviation obtained from the n = 1, 000, 000 simulations 
(with the exception of the first example, where the parameters are on similar scales). For 



comparative ease, and to provide a performance baseline for each example, all RSSE results 



are presented as relative to the RSSE obtained when using the maximal vector of summary 



statistics and no regression adjustment. In this manner, a relative RSSE of x/—x denotes 
an x% worsening/ improvement over the baseline score. 

Within each ABC analysis, we use Euclidean distance within an Epanechnikov kernel 
/Tedls — Sobs ID- The Euclidean distances are computed after standardizing the summary 
statistics with a robust estimate of the standard deviation (the mean absolute deviation). 
The kernel scale parameter, e, is determined as the value at which exactly 1% of the simu- 
lations, (6'*, s*), have non-zero weight. This yields exactly h = 10,000 simulations that form 



the final sample from each posterior. To perform the method of Fearnhead and Prangle 



(2012) , a randomly chosen 10% of the n simulations were used to fit the regression model 



that determines the choice of summary statistics, with the remaining 90% used for the 
ABC analysis. The final ABC sample size n = 10, 000 was kept equal to the other meth- 



ods by slightly adjusting the scale parameter, e. In addition for the method of Fearnhead 



and Prangle (2012), following exploratory analyses, the regression model (14) was fitted 



using f{s) = (s, s , s , s ) for Examples 1 and 2 (as described in Section 2.3.3) and using 
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/(s) = (logs, [logs]^, [logs]^, [logs]^) for Example 3 resulting in always 4 x p independent 



variables in the regression model of equation ( 14 ) . 



When using neural networks or ridge regression to estimate the conditional mean and 
variance, m(s) and cr^(s), we take the pointwise median of the estimated functions obtained 
with the regularisation parameters A = 10^'^, 10^^ and 10~^. These values of A assume 
that the summary statistics and the parameters have been standardized before fitting the 
regression function ( Ripley 1994[ ). However, because the optimisation procedure for neural 
networks (the R function nnet) only finds local optima, in this case we take the pointwise 
median of ten estimated functions, with each optimisation initialised from a different random 
starting point, and randomly choosing the regularisation parameter with equal probability 



from the above values (see Taniguchi and Tresp 1997). 



4.1 Example 1: A coalescent analysis 



This model was previously considered by Joyce and Marjoram (2008) and Nunes and Balding 



(2010) , each while proposing their respective ABC dimension reduction strategies (see Sec- 



tions 2.2.1 and 2.2.2). The analysis focuses on joint estimation of the scaled mutation rate. 



6, and the scaled recombination rate, p, in a coalescent model with recombination (Nordborg 



2007). Under this model, 5,001 basepair DNA sequences for 50 individuals are generated 



from the coalescent model, with recombination, under the infinite-sites mutation model, us- 
ing the software ms (Nordborg 2007). The initial summary statistics, s = (si, . . . ,Si), are 



the number of segregating sites (si), a t/(0,25) random variable (^2), the pairwise mean 
number of nucleotidic differences (53), the mean across pairs separated by < 10% of the 
simulated genomic regions (54), the number of distinct haplotypes (55), the frequency of the 
most common haplotype (se), and the number of singleton haplotypes (57). The statistic S2 
is uninformative for either parameter. 

We first examine the performance of the statistics without using dimension reduction 
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techniques. Table [ij shows the relative RSSE when using a single summary statistic and no 
regression adjustment (centre columns), and when using all six population genetic statistics 
(si and S3-S7) under no, homoscedastic and heteroscedastic regression adjustment (right 
columns). Results are shown for different parameter combinations, 6,p and {6,p). When 



estimating 9 only, we find that the pure rejection method achieves a lower relative RSSE only 
when using the number of segregating sites (si). All other single statistic and parameter 
combinations produce substantially worse than baseline performance. For all inferences (i.e. 
of 6, p and {6,p) jointly), regression adjustments generally improve the inference achieved 



under all six population genetic statistics, which is consistent with previous results (Nunes 



and Balding 2010 ) . The only exception is when jointly estimating {6, p), where homoscedastic 



linear adjustment neither decreases nor increases the error obtained with the pure rejection 
algorithm. 

Next, we investigate the performance of each dimension reduction technique. Table |2] 



shows the relative RSSE obtained under each dimension reduction method, again for each 
parameter combination and type of regression adjustment. The performance achieved with 
AlC, AICc, or BIG is comparable to (i.e. the same or slightly better than) the result ob- 
tained when including all six population genetic statistics. The only case of a substantial 
improvement under AIC/BIC is when using homoscedastic adjustment for the joint estima- 
tion of {6, p) where the baseline error is decreased by 8% with AlC/BlC and is not decreased 
when considering all the summary statistics . When using the e-sufficiency criterion, we find 
that the performance is improved for the inference on 6 only. The only best subset selection 
method for dimension reduction that substantially and uniformly improves the performance 
of ABC posterior estimates is the entropy-based approach. For the projection techniques, 
all methods (partial least squares, neural nets, and minimum expected posterior loss) out- 
perform the baseline inference based on all six population genetics statistics, with a large 
performance advantage for partial least squares when estimating (6',p) jointly. By contrast. 
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ridge regression provides no improvement over the standard regression adjustment (the "All 
six" column). 

Based on these results, a loose performance ranking of the dimension reduction meth- 
ods can be obtained by computing, for each method, the mean (relative) RSSE over all 
parameter combinations 6*, p, and {6,p) using the heteroscedastic adjustment. The worst 
performers were ridge regression and the e-sufficiency criterion (with a mean relative RSSE 
of —3%). These are followed by the standard regression adjustment with all summary statis- 
tics (—5%) and the AIC/BIC, neural nets and the posterior loss method (—6%). The best 
performing methods are partial least squares (—10%), and the two-stage entropy based pro- 
cedure (—16%). 



4.2 Example 2: The fitness cost of drug resistant tuberculosis 

We now consider an example of Markov processes for epidemiological modeling. If a pathogen, 
such as Mycobacterium tuberculosis, mutates to gain an evolutionary advantage, such as an- 
tibiotic resistance, it is biologically plausible that this mutation will come at a cost to the 
pathogen's relative fitness. Based on a stochastic model to describe the transmission and 
evolutionary dynamics of Mycobacterium tuberculosis, and based on incidence and geno- 



typic data of the IS6110 marker, Luciani et al. (2009) estimated the posterior distribution 



of the pathogen's transmission cost and relative fitness. The model contained g = 4 free 
parameters: the transmission rate, a, the transmission cost of drug resistant strains, c, the 
rate of evolution of resistance, p, and the mutation rate of the IS6110 marker, p. 



Luciani et al. (2009) summarised information generated from the stochastic model 
through p = 11 summary statistics. These statistics were expertly elicited as quantities 
that were expected to be informative for one or more model parameters, and included the 
number of distinct genotypes in the sample, gene diversity for sensitive and resistant cases, 
the proportion of resistant cases and measures of the degree of clustering of genotypes etc. 
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It is considered likely that there is dependence, and potentially replicate information within 
these statistics. 

As before, we examine the relative performance of the statistics without using dimension 
reduction techniques. Table |3] shows that for the univariate analysis of c, p, or fi, performing 
rejection sampling ABC with a single, well chosen summary statistic, can provide an im- 
proved performance over a similar analysis using all 11 summary statistics, under any form of 
regression adjustment. In particular, the statistic S2 (the proportion of isolates that are drug 
resistant) is the individual statistic which is most informative to estimate c (with a relative 



RSSE of -7%) and p (-9%). For the marker mutation rate, p, the most informative statistic is 



si, the number of distinct genotypes, with a relative RSSE of -14%. Conversely, an analysis 
using all summary statistics with a regression adjustment offers the best inferential perfor- 
mance for a alone, or for {a,c,p,p). These results provide support for recent arguments 



in favour of "marginal regression adjustments," (Nott et al. 2011) whereby the univariate 
marginal distributions of a full multivariate ABC analysis are replaced by separately esti- 
mated marginal distributions using only statistics relevant for each parameter. Here, more 
precisely estimated margins can improve the accuracy of the multivariate posterior sample, 
beyond the initial analysis. 

The performance results of each dimension reduction method are shown in Table |4| In 
contrast with the previous example, here the use of the AIC/BIC criteria can substan- 
tially decrease posterior errors. For example, compared to the linear adjustment of all 11 
parameters, which produces a mean relative RSSE of -5% (Table [s]), using the AIC/BIC 



criteria with homoscedastic adjustment results in a relative RSSE of between -15% and - 
18%. The e-sufficiency criterion produces more equivocal results, however, as the error is 
sometimes increased with respect to baseline performance (e.g. +6% when estimating a 
with homoscedastic adjustment) and sometimes reduced (e.g. —8% for c, p and 9 with het- 
eroscedastic adjustment). As with the previous example, the entropy criterion provides a 
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clear improvement to the ABC posterior, and this improvement is almost comparable to 
that produced by AIC/BIC. Finally, the projection and regularisation methods mostly all 
provide comparable and substantive improvements compared to the baseline error, with only 
partial least squares producing more equivocal results (e.g. +1% when estimating p with 
homoscedastic adjustment). 

Based on these results, the loose performance ranking of the dimension reduction methods 
determines the worst performers to be the standard least-squares regression adjustment (with 
a mean relative RSSE of —5%), the e-sufficiency approach (—6%) and partial least squares 
(—8%). These are followed by ridge regression (—11%), neural networks and the posterior 
loss method (—12%). The best performing methods for this analysis are the two-stage 
entropy-based procedure (—15%) and the AIC/BIC criteria (—17%). 

In this example, it is interesting to compare the performance of the standard linear 
regression adjustment of all 11 summary statistics (mean relative RSSE of -5%; Table [Z]) 
with that of the ridge regression equivalent (mean relative RSSE of -11%; Table The 
increase in performance with ridge regression may be attributed to its more robust handling 
of multicolinearity of the summary statistics than under the standard regression adjustment. 
To see this. Figure [l] illustrates the relationship between the relative RSSE (|8| (again, relative 
to using all summary statistics and no regression adjustment), and the condition number of 
the matrix X^WX, for both the standard regression (top panel) and ridge regression (bottom 
panel) adjustments based on inference for {a,c,p,fi). The condition number of X^WX is 
given by K = a/ Amax/ Amin, where Amax and Amin are the largest and smallest eigenvalues of 
X^WX. Extremely large condition numbers are evidence for multicolinearity. 

Figure [l] demonstrates that for large values of the condition number (e.g. for k > 10^), 
the least-squares-based regression adjustment clearly performs very poorly. The region of 
K > 10® corresponds to almost 5% of all simulations, and for these cases the relative error 
is hugely increased to anywhere between 5% and 200%. In contrast, for ridge-regression. 
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the relative errors corresponding to k, > 10^ are not larger than the errors obtained for non- 
extreme condition numbers. This analysis clearly illustrates that, unlike ridge regression, the 
standard least-squares regression adjustment can perform particularly poorly when there is 
multicolinearity between the summary statistics. 



In terms of the original analysis of Luciani et al. (2009) which used all eleven summary 



statistics with no regression adjustment (although with a very low value for e), the above 
results indicate that a more efficient analysis may have been achieved by using a suitable 
dimension reduction technique. 

4.3 Example 3: Quality control in the production of clean steels 

Our final example concerns the statistical modelling of extreme values. In the production of 
clean steels, the occurrence of microscopic imperfections (termed inclusions) is unavoidable. 
The strength of a clean steel block is largely dependent on the size of the largest inclusion. 



Bortot et al. (2007) considered an extreme value twist on the standard stereological problem 



(e.g. Baddeley and Jensen 2004), whereby inference is required on the size and number of 
3-dimensional inclusions, based on data obtained from those inclusions that intersect with 
a 2-dimensional slice. The model assumes a Poisson point process of inclusion locations 
with rate parameter r > 0, and that the distribution of inclusion size exceedances above a 
measurement threshold of 5/im are drawn from a generalised Pareto distribution with scale 
and shape parameters a > and C,, following standard extreme value theory arguments (e.g. 



Coles 2001D . 

The observed data consist of 112 cross-sectional inclusion diameters measured above 5/im. 
The summary statistics thereby comprise 112 equally spaced quantiles of the cross-sectional 
diameters, in addition to the number of inclusions observed, yielding p = 113 summary 
statistics in total. The ordering of the summary statistics creates strong dependences between 



them, a fact which can be exploited by dimension reduction techniques. Bortot et al. (2007) 
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considered two models based on spherical or ellipsoidal shaped inclusions. Our analysis here 
focuses on the ellipsoidal model. 

By construction, the large number (2^^'^) of possible combinations of summary statistics 
means that best subset selection methods are strictly not practicable for this analysis, unless 
the number of summary statistics is reduced further a priori. In order to facilitate at least 
some comparison with the other dimension reduction approaches, for best subset selection 
methods only, we consider six candidate subsets. Each subset consists of the number of 
observed inclusions in addition to 5, 10, 20, 50, 75 or 112 empirical quantiles of the inclusion 
size exceedances (the latter corresponds to the complete set of summary statistics). Due 
to the extreme value nature of this analysis, the parameter estimates are likely to be more 
sensitive to the precise values of the larger quantiles. As such, rather than using equally 
spaced quantiles, we use a scheme which favours quantiles closer to the maximum inclusion 
and we always include the maximum inclusion. 

The relative RSSE obtained under each dimension reduction method is shown in Table 
[5| In comparison to an analysis using all 113 summary statistics (the "All 113" column), the 
best subset selection approaches do not in general offer any improvement. While the entropy 
based method provides a slight improvement, the relative RSSE under the e-sufficiency 
criterion is substantially worse (along with partial least squares). Of course, these results 
are limited to the few subsets of statistics considered, and it is possible that alternative 
subsets could perform substantially better. However, it is computationally untenable to 
evaluate this possibility based on exhaustive enumeration of all subsets. 

When using neural networks to perform the regression adjustment based on computing 
the pointwise median of the m(s) and (T^(s) estimates, obtained using varying regularisa- 
tion parameter values (see the introduction to Section |4]), the relative performance is quite 
poor (left hand side RSSE values in Table [s]). The mean relative RSSE is —13% for neural 
networks, compared to —40% for heteroscedastic least-squares regression. As an alterna- 
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tive approach, rather than averaging over the regularization parameter A, we rather choose 
the value of A G {10^'^, 10^^, 10^^} that minimises the leave-one-out error of 9 (equation 



(11)). This approach considerably improves the performance of the network (right hand side 
RSSE values in Table |5| with the mean relative RSSE improving to the same level as for 
heteroscedastic regression; —40%. Adopting the same procedure to determine the regular- 
isation parameter within ridge regression, there is also a mean gain in performance from 
—39% to —42%, although the joint parameter inference on (r, cr, ^) actually performs worse 
under this alternative approach. The variability in these results highlights the importance 
of making an optimal choice of the regularisation parameter for an ABC analysis. 

The minimum expected posterior loss approach performs particularly well here. This 
approach has also been shown to perform well in a similar analysis; that of performing 
inference using quantiles of a large number of independent draws from the (intractable) 



g-axid-k distribution (Fearnhead and Prangle 2012). 

The loose performance ranking of each of the dimension reduction methods finds that the 



worst performers are the ^-sufficiency criterion (with a mean relative RSSE of —16%) and 
partial least squares (—19%). Neural networks and AIC/BIC perform just as well as standard 
least squares regression (—40%), ridge regression slightly outperforms standard regression 
(—42%) and the entropy based approach is a further slight improvement at —44%. The clear 



winner in this example is the posterior loss approach with a mean relative RSSE of —58^ 



5 Discussion 

The process of dimension reduction is a critical and influential part of any ABC analysis. 
In this article we have provided a comparative review of the major dimension reduction ap- 
proaches (and introduced two new ones) in order to provide some guidance to the practitioner 
in choosing the most appropriate technique for their own analysis. A summary of the quali- 
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tative features of each dimension reduction method is shown in Table |6| and a comparison of 



the relative performances of each method (as measured by the mean relative RSSE) for each 
example, is illustrated in Figure [2j As with each individual example, we may compute an 
overall performance ranking of the dimension reduction methods, by averaging the mean rel- 



ative RSSE values over the examples. Performing worse, on average, than a standard least 
squares regression adjustment with no dimension reduction, is the e-sufficiency technique 



(with an overall mean relative RSSE of —8%) and partial least squares (—12%). Performing 
better, on average, than standard least squares regression (—17%) is ridge regression and 
neural networks (—19%) and AIC/BIC (—21%). In this study, the top performers, on aver- 
age, were the entropy-based procedure and the minimum expected posterior loss approach. 



with an overall mean relative RSSE of —25%. 

While being ranked in the top three, a clear disadvantage of the entropy based procedure 
and the AIC/BIC criteria is the quantity of computation required. This primarily occurs as 
best subset selection procedures require evaluation of all 2^ potential models. For examples 
1 and 2, a greedy algorithm was able to find the optimum solution in a reasonable time. This 
was not possible for example 3. Additionally in this latter case, for the subsets of summary 
statistics considered, the performance obtained by implementing computationally expensive 
methods of dimension reduction was barely an improvement over the computationally cheap, 
least squares regression adjustment. This raises the important point, that the benefits of 
performing potentially expensive forms of dimension reduction over, say, the simple linear 
regression adjustment, should be evaluated prior to their full implementation. We also note 



that the second stage of the entropy-based method (Section |2.2.2 ) targets minimisation of 
(|9|, the same error measure used in our comparative analysis. As such, this approach is 
likely to be numerically favoured in our results. 

The top ranked (ex aequo) minimum expected posterior loss approach particularly out- 
performs other dimension reduction methods in the final example (the production of clean 
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steels). In such analyses, with large numbers of summary statistics (here p = 113), non-linear 
methods such as neural networks may become overparametrised, and simpler alternatives 
such as least squares or ridge regression adjustment can work more effectively. This is nat- 
urally explained through the usual bias-variance tradeoff: more complex regression models 
such as neural networks reduce the bias of the estimate of m{s) (and optionally o"^(s)), but 
in doing so the variance of the estimate is increased. This effect can be especially acute for 



high-dimensional regression (Geman et al. 1992). 



Our analyses indicate that the original least squares, linear regression adjustment (Beau- 



mont et al. 2002) can sometimes perform quite well, despite its simplicity. However, the 



presence of multicolinearity between the summary statistics can cause severe performance 
degradation, compared to not performing the regression adjustment (see Figure [T]). In such 
situations, regularisation procedures such as ridge regression (e.g. Example 2, and Figure [T]) 
and projection techniques can be beneficial. 

However, an important issue with regularisation procedures, such as neural networks 
and ridge regression, is the handling of the regularisation parameter, A. The 'averaging' 
procedure that was used in the first two examples, proved quite suboptimal in the third, 
where a cross-validation procedure to select a single best parameter value produced much 
improved results. This problem can be particularly critical for neural networks with large 
numbers of summary statistics, p, as the number of network weights is much larger than p, 
and accordingly, massive shrinkage of the weights (i.e. large values of A) is required to avoid 
overfitting. 

The posterior loss approach produced the superior performance in the third example. In 
general, a strong performance of this method can be primarily attributed to two factors. 
Firstly, in the presence of large numbers of highly dependent summary statistics, the ex- 
tra analysis stage in determining the most appropriate regression model (14) by choosing 
/(s) through e.g. BIG diagnostics, affords the opportunity to reduce the complexity of the 
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regression in a simple and relatively low-parameterised manner. This was not a primary 
contributor in example 3, however, as the regression (equation ([l4))) was directly performed 
on the full set of 113 statistics. Given the benefits of using regularisation methods in this 
setting, it is possible that a ridge regression model would allow a more robust estimate of 
the posterior mean (as a summary statistic) as part of this process. Secondly, the posterior 
loss approach determines the number of summary statistics to be equal to the number of 
posterior quantities of interest - in this case, g = 3 posterior parameter means. This small 
number of derived summary statistics naturally allows more precise posterior statements to 
be made, compared to dimension reduction methods that produce a much larger number of 
equally informative statistics. Of course, the dimension advantage here is strongly related 
to the number of parameters (g = 3) and summary statistics (p = 113) in this example. 
However, it is not fully clear how any current methods of dimension reduction for ABC 
would perform for substantially more challenging analyses with considerably higher numbers 
of parameters and summary statistics. This is because the curse of dimensionality in ABC 



(Blum 2010a) has tended to restrict existing applications of ABC methods to problems of 
moderate parameter dimension, although this may change in the future. 

What is very apparent from this study, is that there is no single 'best' method of dimen- 
sion reduction for ABC. For example, while the posterior loss and entropy based methods 
were the best performers for example 3, AIC and BIC were ranked first in the analysis of 
example 2, and partial least squares outperformed the posterior loss approach in example 1. 
A number of factors can affect the most suitable choice for any given analysis. As discussed 
above, these can include the number of initial summary statistics, the amount of dependence 
and multicolinearity within the statistics, the computational overheads of the dimension re- 
duction method, the requirement to suitably determine hyperparameters, and sensitivity to 
potentially large numbers of uninformative statistics. 

One important point to understand is that all of the ABC analyses of this review were 
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performed using the rejection algorithm optionally followed by some form of regression ad- 
justment. While alternative, potentially more efficient and accurate methods of ABC pos- 
terior simulation exist, such as Markov chain Monte Carlo or sequential Monte Carlo based 
samplers, the computational cost of separately implementing such an algorithm 2^ times (in 
the case of best subset selection methods) means that such dimension reduction methods 
can become rapidly untenable, even for small p. The price of the benefit of using the more 
computationally practical, fixed large number of samples, is that decisions on the dimension 
reduction of the summary statistics will be made on potentially worse estimates of the pos- 
terior than those available under superior sampling algorithms. As such, the final derived 
summary statistics may in fact not be those which are most appropriate for subsequent use 
in e.g. Markov chain Monte Carlo or sequential Monte Carlo based algorithms. 

However, this price is arguably a necessity. It is practically important to evaluate the 
performance of any dimension reduction procedure in a given analysis. Here we used the 



mean relative RSSE (c.f. |9j) that is based on a leave-one-out procedure. When using a fixed, 
large number of samples, evaluation of such a performance diagnostic is entirely practicable, 
as no further model simulations are required. This idea is also relevant to methods of 
dimension reduction that have been proposed specifically for ABC model selection (e.g. 



Barnes et al. 2011), where a misclassification rate based on a leave-one-out procedure can 



serve as a comparative criterion. 
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One summary statistic (no adj.) 

Sl S2 S3 54 S5 Sq S-j 

-7 81 7 64 29 32 35 


Six summary statistics (si, S3-S7) 
no adj. homo adj. hetero adj. 
-3 -3 


P 


20 20 20 19 9 12 10 


-5 -4 


{d,p) 


7 26 11 23 9 12 13 


0-7 



Table 1: Relative RSSE for Example 1 (the coalescent analysis) without using regression- 
adjustment for different parameter combinations, 6,p and {0,p). Centre columns show relative 
RSSE using one summary statistic. Rightmost columns show relative RSSE using all six popu- 
lation genetic summary statistics (si-sy) under no, homoscedastic and heteroscedastic regression 
adjustment. All RSSE are relative to the RSSE obtained when using no regression adjustment with 
the six summary statistics (si, S3-S7). 
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-7 
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-7 
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-6 



Table 2: Relative RSSE for Example 1 (the coalescent analysis) for different parameter combi- 
nations, 9,p and {9,p), using each method of dimension reduction, and under no, homoscedastic 
and heteroscedastic regression adjustment. Columns denote no dimension reduction (All six), BIC, 
AIC, AICc, the e-sufficiency criterion (e-suff), the two-stage entropy procedure (Ent), partial least 
squares (PLS), neural networks (NNet), minimum expected posterior loss (Loss) and ridge regres- 
sion (Ridge). All RSSE are relative to the RSSE obtained when using no regression adjustment 
with the six summary statistics (si, S3-S7). 
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Table 3: Relative RSSE for Example 2 (drug resistant tuberculosis) without using regression- 
adjustment for different parameter combinations, a,c,p,p and {a,c,p,p). Centre columns show 
relative RSSE using one summary statistic. Rightmost columns show relative RSSE using all eleven 
summary statistics under no, homoscedastic and heteroscedastic regression adjustment. All RSSE 
are relative to the RSSE obtained when using no regression adjustment with all eleven summary 
statistics (si-sn). 
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Table 4: Relative RSSE for Example 2 (drug resistant tuberculosis) for different parameter com- 
binations, a,c,p,p and {a,c,p,p), using each method of dimension reduction, and under no, ho- 
moscedastic and heteroscedastic regression adjustment. Columns denote no dimension reduction 
(All 11), BIC, AlC, AlCc, the e-sufficiency criterion (e-suff), the two-stage entropy procedure (Ent), 
partial least squares (PLS), neural networks (NNet), minimum expected posterior loss (Loss) and 
ridge regression (Ridge). All RSSE are relative to the RSSE obtained when using no regression 
adjustment with all eleven summary statistics (si-sn). 
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The first value is found by integrating out the regularisation parameter whereas the second one is found 
by choosing an optimal regularisation parameter with cross-validation. 



Table 5: Relative RSSE for Example 3 (production of clean steels) for different parameter combina- 
tions, r, o", ^ and (r, a, using each method of dimension reduction, and under no, homoscedastic 
and heteroscedastic regression adjustment. Columns denote no dimension reduction (All 113), BIC, 
AIC, AICc, the e-sufficiency criterion (e-suff), the two-stage entropy procedure (Ent), partial least 
squares (PLS), neural networks (NNet), minimum expected posterior loss (Loss) and ridge regres- 
sion (Ridge). All RSSE are relative to the RSSE obtained when using no regression adjustment 
with all 113 summary statistics (si-sna) 



Class 

Best subset selection 
Projection techniques 



Method Hyper-parameter 

AIC/BIC None 

e-suflf T{e) 

Ent None 

PLS Number of PLS compoucnts, k 

NNet Regularisation parameter, A 

Loss Choice of basis functions 



Choice of hyper-parameter 
User choice 



Computational burden 

Substantial/greedy alg. 
Substantial/greedy alg. 
Substantial/greedy alg. 



Cross-validation Weak 
Integration or cross-validation Moderate (optimization algorithm) 
BIC Weak (closed-form solution) 



Regularisation Ridge Regularisation parameter, A Integration or cross-validation Weak (closed-form solution) 

Table 6: Summary of the main features of the different methods of dimension reduction for ABC. 
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Figure 1: Scatterplots of relative RSSE versus the condition number of the matrix X WX for 
hnear least-squares (top) and ridge (bottom) regression adjustments. Points are based on joint in- 
ference for (a, c, p, jx) using 1,000 randomly selected vectors of summary statistics, s', as "observed" 
data. When the minimum eigenvalue, Amim is zero, the (infinite) condition number is arbitrarily 
set to be 10^^ for visual clarity (open circles on the scatterplot) . 
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Figure 2: Mean relative RSSE values using each method of dimension reduction and for each 
example. Methods of dimension reduction include no dimension reduction (All), AIC/BIC, the 
e-sufficiency criterion (e-suff), the two-stage entropy procedure (Ent), partial least squares (PLS), 
neural networks (NNet), minimum expected posterior loss (Loss) and ridge regression (Ridge). 
For examples 1 and 2, the results for ridge regression and neural networks estimate m(s) and 
cr^(s) by taking the pointwise median curve over varying values of the regularisation parameter; 
A = 10^^, 10^^ and 10^^ (see introduction to Section[4]). For example 3, an optimal value of A was 
chosen based on a cross-validation procedure (see Section 4.3). 
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